An Analysis of a Mandarin-English Code-switching Speech Corpus: SEAME
نویسندگان
چکیده
SEAME (South East Asia Mandarin-English) is a 30 hours spontaneous Mandarin-English code-switching speech corpus recorded from Singapore and Malaysia speakers. In this paper, we report a series of analyses on the recording, processing time and voice activity rate (VAR) of the speech recording, transcription, validation and language boundaries labeling processes. In addition, the duration of the monolingual segment in the code-switching utterance and the analysis of the speakers‟ behavior in language switching during conversation are also described. The results of the analysis show that 80% and 72% monolingual segments of English and Mandarin in the code-switching utterance are shorter than one second. In over 80% of the cases, speakers directly switch language without any short pause and discourse particle between two adjacent different languages.
منابع مشابه
SEAME: a Mandarin-English code-switching speech corpus in south-east asia
In Singapore and Malaysia, people often speak a mixture of Mandarin and English within a single sentence. We call such sentences intra-sentential code-switch sentences. In this paper, we report on the development of a Mandarin-English codeswitching spontaneous speech corpus: SEAME. The corpus is developed as part of a multilingual speech recognition project and will be used to examine how Manda...
متن کاملFeatures for factored language models for code-Switching speech
This paper presents investigations of features which can be used to predict Code-Switching speech. For this task, factored language models are applied and implemented into a state-of-the-art decoder. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and open class word clusters are explored. We find that Brown word clusters, part-of-speech tag...
متن کاملA Mandarin-English Code-Switching Corpus
Generally the existing monolingual corpora are not suitable for large vocabulary continuous speech recognition (LVCSR) of codeswitching speech. The motivation of this paper is to study the rules and constraints code-switching follows and design a corpus for code-switching LVCSR task. This paper presents the development of a Mandarin-English code-switching corpus. This corpus consists of four pa...
متن کاملFunctions of Code-Switching Strategies among Iranian EFL Learners and Their Speaking Ability Improvement through Code-Switching
This study investigated the impact of code-switching on speaking ability of Iranian low proficiency EFL learners. Moreover, it was an attempt to show what functions existed behind code-switching strategies used by the EFL learners. To this end, 60 male and female Iranian EFL learners age-ranged between 20 and 30 participated in the study. Data collection instruments which were used were the Int...
متن کاملFunctions of Code-Switching Strategies among Iranian EFL Learners and Their Speaking Ability Improvement through Code-Switching
This study investigated the impact of code-switching on speaking ability of Iranian low proficiency EFL learners. Moreover, it was an attempt to show what functions existed behind code-switching strategies used by the EFL learners. To this end, 60 male and female Iranian EFL learners age-ranged between 20 and 30 participated in the study. Data collection instruments which were used were the Int...
متن کامل